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Block Algorithms for Quark Propagator Calculation 

Stephen M. Pickles^ , UKQCD Collaboration 

'^Department of Physics and Astronomy, University of Edinburgh, Edinburgh EH9 3JZ, Scotland 

Computing quark propagators in lattice QCD is equivalent to solving large, sparse linear systems with multiple 
right-hand sides. Block algorithms attempt to accelerate the convergence of iterative Krylov-subspace methods 
by solving the multiple systems simultaneously. This paper compares a block generalisation of the quasi-minimal 
residual method (QMR), Block Conjugate Gradient on the normal equation. Block Lanczos and (75-symmetric) 
Block BiConjugate Gradient. 



1. Introduction 

The calculation of quark propagators remains a 
major bottleneck in lattice QCD. The problem is 
to solve the matrix equation 

Mip = 77 (1) 

for ijj given several right-hand sides (sources) 
77. The fermion matrix M is non-hermitian (we 
use Wilson fermions with clover improvement), 
sparse and very large. Recent attempts to accel- 
erate the solution of Eq. (|l]) have focused on: 

1. improved iterative methods 

2. improved preconditioners, or 

3. solving related systems. 

There is growing consensus that the first of 
these areas is now mature |^,|[ , with BiCGSTAB 
emerging as the method to beat. 
Finding a better preconditioner is complicated 
by the now universal requirement of scalability 
on parallel computers; red-black preconditioning 
(used in the present study) is beginning to give 
way to LL-SSOR but the last word on pre- 
conditioning has not been said. 
The third area is the exploitation of information 
found in the solution of one system to accelerate 
the convergence of another. Keeping the source 
fixed and varying k leads to a family of multi- 
ple mass tricks Keeping the matrix fixed 
and varying the source leads to the ideas of defla- 
tion 1^ and block algorithms, the several systems 
being solved simultaneously in the latter but se- 
quentially in the former. 

Two properties of M are relevant here. 



1. M — 75Af^75 is 75-symmetric. This has 
been exploited to halve the computational 
costs of QMR and BiCG §. 

2. iM is a shifted matrix with respect to its 
dependence on Multiple-mass tricks de- 
pend on this property. 

Red-black preconditioning preserves property (^ , 
but destroys property (^) except in the unim- 
proved case of Csw — 0. 

2. Block Algorithms 

Using a Krylov subspace method s times to solve 



systems M^^^'' = 



(1) 



leads 



to the construction of several overlapping Krylov 
subspaces. In the worst case, ie. when the num- 
ber of iterations required for convergence equals 
the order N of M, the overlap will be complete. 
By solving the s systems simultaneously, block al- 
gorithms eliminate the redundant matrix-vector 
operations in the above approach. One assem- 
bles the s right-hand sides into an x s matrix 
H = {rj'-^\ ■ ■ ■ ,77'-*-') and solves 

= H (2) 
for ^ = (7/;(i),. .. , V^')). 

On the other hand, a perfect preconditioner (one 
which coincides exactly with M~^), solves the 
system with a single multiplication; in this case 
there is no gain to be had from the block algo- 
rithm. In practice we hope to have good precondi- 
tioners, so that we usually solve the point (s — 1) 
problem to the desired accuracy in much less than 
A^ multiplications. These considerations lead one 
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to expect that blocking will be most effective on 
badly conditioned systems and/or small volumes, 
ie. when the number of iterations required for the 
point algorithm to converge is comparable to N. 
Blocking introduces certain overheads. The first 
is memory; the storage requirements for vectors 
in the point algorithm are multiplied by s when 
going to the block algorithm. Secondly, vector- 
vector operations such as 

y = ax + y and /3 = y^x (3) 

in the point algorithm generalise to 

Y = Xa + Y and (3 = (4) 

where X and Y are N x s matrices and a and 
(3 have become s x s matrices. Thus the number 
of vector- vector operations required per iteration 
scales as s^. 

2.1. B-CGNR and B-Lanczos 

Block Conjugate Gradient Q can be applied to 
the normal equation M^M'^ = M^H. Increasing 
s yields a clear improvement in convergence, but 
not enough to defray the cost of squaring the con- 
dition number. Reference studied a method 
based on the hcrmitian block Lanczos process and 
applied it to 75 M^* = 75 i?. B-Lanczos clearly 
outperforms B-CGNR. Unfortunately, 75 is a bad 
preconditioncr for Eq. 

2.2. Block B-BiCG(75) 

The algorithm presented here is a special case of, 
and easily derived from, the Block Bi-Conjugate 
Gradient algorithm of O'Lcary 0. I have used 
75-symmetry to eliminate multiplications by M\ 
and have followed her important suggestion of or- 
thonormalising the columns of Pk. 

B-BiCG(75) 

Rq = H - M*o 
Pq = ^5 75^0 

PqSo = Ra (5) 
for fc = 0, 1, 2, . . . until convergence do { 
T = MPu 



Pk+i 



Rk+i = —Tak + Rk 



Pk = SkPk ^Pk 
T = Rk+i 

Pk+lSk+l 



k+ii^Rk+i 

1 

Pkpk 

T 



} 



The operations M and 1 
tions, as are (|^), (g) and 



(6) 

are QR decomposi- 
below. 



(7) 



2.3. Block QMR 

The original QMR (Quasi-Minimal Residual) al- 
gorithm 1^ used blocks of variable sizes in look- 
ahead steps to avoid Lanczos breakdown. Boyse 
and Seidl ||T^ described a block version of QMR 
for complex symmetric matrices, using fixed- 
size blocks to accelerate convergence. Subse- 
quently Freund and Malhotra discovered a 
non-hermitian version. The version presented 
here was developed by me independently of [ pT| . 

B-QMR(75) 

Po - P-i = 1^0 - 
Co = fo-i = foo = 0; ao = d-i = da = I 
Ro^Vi= H - M*o 
Vipi = Vi 
fi = pi 

for fc = 1, 2, . . . until convergence do { 

4 = v^jsVk 

Pk = (>k-lPk^k 

T = MVk - Vk-iPk 

Vfe+i = T - VkUk 
Vk+iPk+i ~ Vfc+i 

^fc — &fc-2/3fc 

e/c = ak-idk-2Pk + h-icnk 
Cfc = Ck-idk-2Pk + dk-iak 

J \ Pk+i 
Pk = [Vk - Pk-iek - Pk-20kKk^ 

Tk = CLkTk] Tk+l = Ckfk 
*fe = + PkTk 



(8) 



a-k 

Ck 



bk 
dk 



(9) 
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The algorithm does not give a recurrence for the 
residual Rk, but the fact that Tr f^ffc is of the 
same order of magnitude as Tr R^Rk is useful in 
formulating a stopping criterion. 
This version does nothing to address the prob- 
lems of Lanczos breakdown and (near) linear de- 
pendence in the columns of Vk- These can be 
brought under control, as shown in 
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Convergence histories for B-BiCG75, B-QMR75. 
s = 2 (...), 3 (■-■), 4 (---), 6 (- -); 

K = 0.14, V = 8'^, l3 = 5.7, Csw = 1.5678. 

3. Conclusions 

Block algorithms can reduce the number of 
matrix- vector operations required for convergence 
at the expense of more vector-vector operations. 
This does not necessarily lead to a reduction in 
wall-clock time. 



B-QMR(75) and B-BiCG(75) are clearly faster 
than B-CGNR and B-Lanczos, and in some 
regimes (light masses and small volumes) they 
significantly outperform BiCGSTAB. 
However, on lattices of realistic size, the improve- 
ments from blocking are marginal at best. At 
(3 = 6.0 quenched, V = 16^ x 48, I found that 
s = 1 is near-optimal for both B-QMR(75) and 
B-BiCG(75), unless the configuration is excep- 
tional. At P = 5.2, Nf = 2, V = 123 X 24, I 
found s = 1 to be optimal for the same algorithms 
at all Kvaience and Ksea Combinations studied in 
jl^ ]. These conclusions should be reviewed if the 
relative cost of matrix- vector to vector- vector op- 
erations increases significantly. Block algorithms 
may yet have a role to play in conjunction with 
highly-improved actions on coarse lattices. 
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